CS231A Course Project Milestone Report Unsupervised Multi-Modal Feature Learning: Images and Text

نویسنده

  • Maurizio Calo Caligaris
چکیده

Hand-engineering task-specific features for single modalities (e.g. vision) is a difficult and time-consuming task. Furthermore, the challenge gets significantly more pronounced when the data comes from multiple sources (e.g. images and text). In this work, we seek to leverage freely available images on the web along with nearby text to create meaningful feature representations that capture both visual and sematic information. Our hypothesis is that these learnt features can then be used to improve on many different computer vision features on a wide variety of computer vision tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS231A Project Milestone Sign Language Gesture Recognition with Unsupervised Feature Learning

This paper focuses on applying different segmentation approaches and unsupervised learning algorithms to create an accurate sign language recognition model. Future Distribution Permission The author of this report gives permission for this document to be distributed to Stanfordaffiliated students taking future courses.

متن کامل

Unsupervised Learning of Multimodal Features: Images and Text

In the following sections, we present the network architectures we use to learn bi-modal and cross-modal features. We describe an experimental setting which demonstrates that we are indeed able to learn features that effectively capture information from different modalities and that we can further improve on computer vision features if we have other modalities (e.g text) available during featur...

متن کامل

Learning a Semantic Space by Deep Network for Cross-media Retrieval

With the growth of multimedia data, the problem of cross-media (or cross-modal) retrieval has attracted considerable interest in the cross-media retrieval community. One of the solutions is to learn a common representation for multimedia data. In this paper, we propose a simple but effective deep learning method to address the cross-media retrieval problem between images and text documents for ...

متن کامل

Image-Text Multi-Modal Representation Learning by Adversarial Backpropagation

We present novel method for image-text multi-modal representation learning. In our knowledge, this work is the first approach of applying adversarial learning concept to multi-modal learning and not exploiting image-text pair information to learn multi-modal feature. We only use category information in contrast with most previous methods using image-text pair information for multi-modal embeddi...

متن کامل

Cross-modal Sound Mapping Using Deep Learning

We present a method for automatic feature extraction and cross-modal mapping using deep learning. Our system uses stacked autoencoders to learn a layered feature representation of the data. Feature vectors from two (or more) different domains are mapped to each other, effectively creating a cross-modal mapping. Our system can either run fully unsupervised, or it can use high-level labeling to f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011